Conversation
Summary of ChangesHello @smarthg-gi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a new, automated pipeline for importing a wide array of education-related statistics for Pennsylvania. The initiative aims to enrich the data platform with detailed county-level and institutional education metrics, facilitating a deeper understanding of educational trends and demographics within the state. The implemented solution ensures that the data can be regularly updated with minimal manual intervention. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a new data import for Pennsylvania education statistics. The changes include a download script, processing scripts, configuration files, and test data. I've identified several areas for improvement, primarily in the documentation and data mapping files, to ensure correctness and clarity. Key issues include incorrect commands and file paths in the README, typos and malformed entries in the manifest and CSV files, and potentially non-standard enum values in a pvmap file. Addressing these points will improve the robustness and maintainability of the import.
| 210748,ipedsId/210748,Altoona Beauty School Inc,,,,,, | ||
| 210775,ipedsId/210775,Alvernia University,,,,,, | ||
| 211088,ipedsId/211088,Arcadia University,,,,,, | ||
| 211006,ipedsId/211006,Automotive Training Center-Exton,,,,,,` |
| Westmoreland,geoId/42129 | ||
| Wyoming,geoId/42131 | ||
| York,geoId/42133 | ||
| Statewide [cite: Data], No newline at end of file |
There was a problem hiding this comment.
This line is malformed. It seems to be an incomplete entry for "Statewide" and contains a note [cite: Data] which should not be in the data. This line will likely be ignored or cause an error during processing. Please either complete the entry with a valid dcid or remove the line. Also, the file is missing a newline at the end.
| Pennsylvania State University-Penn State Shenango,ipedsId/214634 | ||
| Pennsylvania State University-World Campus,ipedsId/479956 | ||
| Pennsylvania College of Technology,ipedsId/366252 | ||
| The Pennsylvania State University, |
| **Run the following command:** | ||
|
|
||
| ```bash | ||
| sh run.sh |
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv" | ||
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" | ||
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" | ||
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" | ||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||
| --places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv" | ||
| ``` |
There was a problem hiding this comment.
This command has two issues:
- The
--input_dataargument uses a glob pattern (*.csv), which is inconsistent with therun_processing.shscript. It should be the specific filename. - It incorrectly includes a
--places_resolved_csvargument. This dataset does not have a correspondingplaces_resolved.csvfile, and therun_processing.shscript does not use this argument for this dataset.
| python3 stat_var_processor.py | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv" | |
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" | |
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" | |
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" | |
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | |
| --places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv" | |
| ``` | |
| python3 stat_var_processor.py \ | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender.csv" \ | |
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" \ | |
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" \ | |
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" \ | |
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf |
|
|
||
| ```bash | ||
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv" |
There was a problem hiding this comment.
The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv" | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race.csv" |
|
|
||
| ```bash | ||
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv" |
There was a problem hiding this comment.
The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv" | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment.csv" |
| "import_name": "Pennsylvania_Education", | ||
| "curator_emails": ["[email protected]"], | ||
| "provenance_url": "https://data.pa.gov/", | ||
| "provenance_description": "Dataset related to the pennsylvania's Education at country level.", |
There was a problem hiding this comment.
There are a couple of typos in the provenance_description. "country" should be "county", and "pennsylvania's" should be "Pennsylvania's".
| "provenance_description": "Dataset related to the pennsylvania's Education at country level.", | |
| "provenance_description": "Dataset related to Pennsylvania's Education at county level.", |
| Engineering Technologies and Engineering-Related Fields,enrollmentLevel,Graduate,bachelorsDegreeMajor,EngineeringMajor,,,,,,,, | ||
| Physical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,PhysicalSciences,,,,,,,, | ||
| Mathematics and Statistics,enrollmentLevel,Graduate,bachelorsDegreeMajor,MathAndStatisticsMajor,,,,,,,, | ||
| Biological and Biomedical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,BiologicalAndBiomedicalSciencesMajor,,,,,,,, | ||
| Science Technologies/Technicians,enrollmentLevel,Graduate,bachelorsDegreeMajor,ScienceAndTechnologiesMajor,,,,,,,, |
There was a problem hiding this comment.
Some of the values for bachelorsDegreeMajor seem to be non-standard. Please check if these are correct Data Commons enums:
EngineeringMajor(line 4) should likely beEngineering.MathAndStatisticsMajor(line 6) should likely beMathematicsAndStatistics.ScienceAndTechnologiesMajor(line 8) should likely beScienceAndTechnology.
Using incorrect enum values will lead to issues during data import.
Engineering Technologies and Engineering-Related Fields,enrollmentLevel,Graduate,bachelorsDegreeMajor,Engineering,,,,,,,,
Physical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,PhysicalSciences,,,,,,,,
Mathematics and Statistics,enrollmentLevel,Graduate,bachelorsDegreeMajor,MathematicsAndStatistics,,,,,,,,
Biological and Biomedical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,BiologicalAndBiomedicalSciencesMajor,,,,,,,,
Science Technologies/Technicians,enrollmentLevel,Graduate,bachelorsDegreeMajor,ScienceAndTechnology,,,,,,,,
No description provided.